Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Boundary Fields #159

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

[WIP] Boundary Fields #159

wants to merge 2 commits into from

Conversation

george-silva
Copy link

@george-silva george-silva commented Jan 27, 2017

Goals

  • Add a boundary field for all the models that might have boundaries;
  • Add a mechanism to import boundary/GIS data to the correct places;

Considerations

  1. The boundary field is nullable and of type MultiPolygon, to accurately represent data that have islands, such as Japan, that is made up of multiple parts;
  2. The boundary field was added to Place model, so all inheritors have the field as well.

TODO

  1. Fields added to the models;
  2. Changed from ROOT_URLCONF test_project.urls to test_app.urls - when trying to migrate the test_app would not find test_project.urls, so this is kinda of a fix;
  3. Changed WSGI_APPLICATION from test_project.wsgi.application to test_app.wsgi.application. Also kinda of a fix.
  4. Determine the best source for GIS data;
  5. Change the import command so it can handle boundary data;

Edit (by blag): Changed checklists into GitHub-flavored Markdown TODO list so it gets a progress bar in the PR list page

@blag
Copy link
Collaborator

blag commented Jan 27, 2017

I would use Geonames shapes_simplified_low.zip file for country boundaries. It's got two tab-separated columns: geonameid and geojson, so the existing get_data function should handle it just fine. You can deserialize the geojson with django-geojson or geojson.

Importing other sources of boundary data is probably going to be more involved. I'll look into that.

@blag
Copy link
Collaborator

blag commented Jan 27, 2017

I thought I configured Travis to run our tests all pull requests, but that option was turned off.

I've turned it back on. If you push any more to this pull request, it should run the tests automatically against Python 2.7, 3.3-3.6 and on Django 1.7-1.10.

I'll be adding tests with Django 1.11 Real Soon Now (tm); no later than its official release.

@blag
Copy link
Collaborator

blag commented Jan 27, 2017

Closing and reopening to try to kickoff a Travis run.

@blag blag closed this Jan 27, 2017
@blag blag reopened this Jan 27, 2017
@blag
Copy link
Collaborator

blag commented Jan 28, 2017

@george-silva If you don't want to push any more commits, but want to run Travis tests, it may work if you close and reopen this PR. I don't really have time right now to debug why Travis isn't working, but it's something I'll focus on fixing tonight or tomorrow.

@blag
Copy link
Collaborator

blag commented Jan 29, 2017

This repo has some of the GeoJSON files we need for the US:

https://github.com/jgoodall/us-maps/tree/master/geojson

The original source for those files also has information for "Urban Areas" and "Consolidated Cities" from the 2000 & 2010 US census:

http://www.census.gov/geo/maps-data/data/tiger-line.html

I'm still looking around for sources for info for other countries.

@george-silva
Copy link
Author

Hello @blag!

I think the geonames source for countries will work out just fine.

I'll check today if OSM has the state/city data.

thanks for the tip regarding Travis. I'll keep an eye on it.

@blag
Copy link
Collaborator

blag commented Jan 30, 2017

@george-silva Sorry, I wasn't clear: yep, I agree, let's use the Geonames data for countries, period.

For boundaries of country subdivisions (eg: regions and below), I would also like to use OSM data wherever possible - it's comprehensive (international), highly precise, clearly licensed, and legally unencumbered.

Check out these per-country boundary files:

https://mapzen.com/data/borders/

We could only import boundaries for selected countries using all of those dumps, or we could download the entire planet file if all boundaries are chosen.

The OSM wiki has a good explanation on administrative levels:

https://wiki.openstreetmap.org/wiki/Tag:boundary%3Dadministrative

and if I'm reading that correctly, it means we could pick out boundaries for regions, subregions, cities, and districts from a single boundary file.

I'm still not sure where we could get postal code areas for countries. That's where that source comes in. The zcta5.json file has exactly what we want, but only for the US. Finding postal code boundaries for every other country might be more challenging.

@george-silva
Copy link
Author

george-silva commented Jan 30, 2017

@blag well, I guess the conjunction between the OSM wiki + mapzen's data will suffice.

The hard part is do organize the data that is in the wiki.

Here's my proposal for this:

We have:

  • Continents;
  • Countries;
  • Regions;
  • City;
  • Postal Areas;
  1. Ignore postal areas for now.
  2. Continents and countries are consistent. We can use OSM data and I think it will be easy to download/use;
  3. for all the other levels (region/city), we use OSM data, from mapzen. It will be easy to replicate their infrastructure, if they decide to take it out, later on.
  4. We'll create a dict maping on our code, to specify which model corresponds to each boundary type in OSM.

Something like:

CITIES_BOUNDARY_MAPPING = {
    # country code, list of administrative levels that correspond with our proposed boundary
    'bra': [4,8],
    'foo': [3,6]
}

# or

CITIES_BOUNDARY_MAPPING = {
    # country code, list of administrative levels that correspond with our proposed boundary
    'bra': {'region': 4, 'city': 8},
    # etc
}

If you check the docs from OSM's wiki, you can see that 4 in Brazil corresponds to states and 8 to cities.

In this way we can download and use the correct data and we don't need to map out all of the countries upfront. We can let other users add their own mappings. And if it's a settings, they don't even need to PR, they can configure it for their project.

The downside of this approach is that we require an extra conf. step and if the user wants to use different administrative regions (in Brazil's case he wants to use regions, states and macro-regions) we won't be able to do it.

What do you think about that?

@blag
Copy link
Collaborator

blag commented Jan 30, 2017

That all sounds good, I would like to make sure we have good default options, to minimize the number of options people have to change.

@blag
Copy link
Collaborator

blag commented Jan 30, 2017

We may be able to use Zillow's data for District objects in the US:

http://www.zillow.com/howto/api/neighborhood-boundaries.htm

Although the CC-BY-SA 3.0 license may not work for some of our users.

I'm not trying to focus just on the US, but it doesn't seem that there is high quality corresponding data for other countries.

@blag
Copy link
Collaborator

blag commented Jan 30, 2017

And this may work for cities:

http://www.gisgraphy.com/

Openstreetmap data extract by country

  • ...
  • Extracted the shape of more than 160,000 cities and localities from Quatroshapes with their associated geonames Id

@george-silva
Copy link
Author

george-silva commented Feb 1, 2017

@blag quattroshapes is interesting.

I've downloaded quattroshapes and I've downloaded the mapzen's country data to check them out.

Findings:

  1. Why is so important to have an geonames id? For what I've looked in the models we don't store them. Is this the code model attribute, that varies from model to model? If so, it might be fine.
  2. Mapzen's data is perfect. The problem is matching the data against our current downloaded data. For brazilian states, the tag ref was the one being used to associate the state code. I'm not sure if that holds true for all. The schema we discussed earlier would work perfectly, thought that might need to be a dict of dicts, where the user specifies the admin level and the field where the join between the data will be made (quattroshapes might need the same settings)
  3. Mapzen's data can be downloaded per country;
  4. Quattroshapes means that for countries, states and cities we need to download 3 shapefiles, admin0, 1 and 2. We need to download it fully, but we can filter at import time, if the user/dev only wants a single country;

Which data is best

  1. They pretty much look the same, but OSM data might be updated more frequently.
  2. Quattroshapes data is smaller. ~300mb zipped. Only Brazil is 48Mb from Mapzen;
  3. Quattroshapes needs to be downloaded fully;
  4. OSM data is updated more often;

Import strategies

  1. We require PostGIS, so GDAL is also available. That means we have the LayerMapping utility. We only need to map the ID in question and the geometry field;
  2. Regardless of the boundary data source, filtering must be done on already present data from GeoNames. I imagine a loop in the selected countries for it's child entities and the import process for each.

Suggestions? Considerations?

I wanted to look at the options first beforing writing any code.

Just to be clear, my preference: OSM/Mapzen. It will be trickier but I think it's a good source, configurable, etc.

@george-silva
Copy link
Author

@blag this might be better (also from Mapzen): https://whosonfirst.mapzen.com/

@nvkelso
Copy link

nvkelso commented Feb 2, 2017

Who's On First is the successor to Quaatroshapes and includes neighbourhoods and postal codes. Mapzen has multiple staff working on the project, it's seen huge progress the last 18 months. There are multiple download options (metafiles, bundles).

Please let us know how we can be of help.

@blag
Copy link
Collaborator

blag commented Feb 3, 2017

@george-silva

We import the geonameid as the primary key for continents, countries, regions, subregions, cities, and alternative names. For some reason that doesn't hold true for districts, even though the data source includes that information. At this point I want to stay backwards compatible for existing users, but I have been thinking of eventually creating a release that isn't backwards compatible, and one of the changes I want to make is to use their geonameid as their id.

The rest sounds good.

@nvkelso Awesome, thanks for all of your hard work! It would help us if you included the geonameids in your data, and separated your data by country so we only need to download/import the minimum amount of data.

@nvkelso
Copy link

nvkelso commented Feb 3, 2017 via email

@george-silva
Copy link
Author

Ok, I've managed to understand whos' on first data.

What we'll need here is to:

  1. Download the CSV meta file for the interested places. We we'll need to filter on common stuff, so the best bet here are country ISO codes;
  2. Once we download the metafile (it needs to be fully downloaded), we can capture the WOF ID and URL for that country;
  3. We load up country boundary, based on a API (since we are filtering per country, I guess it's the easy way - we can download also via AWS)
  4. We download the metafiles for the other layers of interest (state/region and county)
  5. Filter those metafiles based on the first obtained ID and grab a list of URLs where the actual data is located;
  6. Loop the records and grab each one with a geojson serializer/deserializer, filter the current database data based on geonames and update.

Quite involved process 😄

I'll start some new modules to do all this work.

@nvkelso
Copy link

nvkelso commented Feb 3, 2017

We're also experimenting with "bundles" per placetype (but downloads for entire planet):

@blag
Copy link
Collaborator

blag commented Feb 4, 2017

@george-silva Sounds good, thanks for taking this on. ❤️

@coderholic Do you have any advice/recommendations for us?

@george-silva
Copy link
Author

Hello guys. I'll be at a customers office and this might take a while to get done.

I'm still up for it, but this week might be a little busy.

@blag blag closed this Feb 14, 2017
@blag blag reopened this Feb 14, 2017
@blag
Copy link
Collaborator

blag commented Feb 14, 2017

Whoops, I hit the wrong button there.

Sorry I haven't been attentive lately - job interviews. I should have some free time to check this out next week.

@george-silva
Copy link
Author

@blag no problem. I'll get back to this next week. These two past weeks I have been traveling extensively.

@blag
Copy link
Collaborator

blag commented Apr 6, 2017

@george-silva This looks good so far, except for the changes in test_project/test_app/settings.py. Is there some reason you're changing those in this PR?

I'm still interviewing for jobs, but I might have time to flesh out the import script a bit more in the next few weeks.

@blag blag mentioned this pull request Jun 17, 2017
@@ -86,6 +88,8 @@ def save(self, *args, **kwargs):
class BaseContinent(Place, SlugModel):
code = models.CharField(max_length=2, unique=True, db_index=True)

objects = models.GeoManager()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obsolete now!

@adamhaney
Copy link
Collaborator

Hello all, I've just started helping out with project maintenance and I'd like to ask, is this PR dead? If someone is still working on it I'll gladly keep it open but otherwise I'm going to close it to clean up dangling PRs. If I don't hear back by in the next 7 days I'll assume that this work has been abandoned.

Thanks,

Adam

@adamhaney adamhaney added the Awaiting Developer Update This PR is waiting on the requesting developer to make changes before it can be reviewed again label Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Developer Update This PR is waiting on the requesting developer to make changes before it can be reviewed again
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants